The Development of Text-Mining Tools and Algorithms

نویسنده

  • Daniel Waegel
چکیده

This paper describes the first version of the TextMOLE (Text Mining Operations Library and Environment) system for textual data mining. Currently TextMOLE acts much like an advanced search engine: it parses a data set, extracts relevant terms, and allows the user to run queries against the data. The system design is open-ended, robust, and flexible. The tool is designed as a utility for quickly analyzing a corpus of documents and determining which parameters will provide maximal retrieval performance. Thus an instructor can use the tool to demonstrate artificial intelligence concepts in the classroom, or use the tool to encourage hands on exploration of the concepts often covered in an introductory course in information retrieval or artificial intelligence. Reseachers will find the tool useful when a ’quick and dirty’ analysis of a unfamiliar collection is required. In addition to discussion of TextMOLE, this paper describes an algorithm that uses TextMOLE as a platform for testing and implementing. The most common retrieval systems run queries independently of one another — no data about the queries is retained from query to query. This paper describes an unsupervised learning algorithm that uses information about previous queries to prune new query results. The algorithm has two distinct phases. The first trains on a batch of queries; in doing so, it learns about the collection and the relationship between its documents. It stores this information in a document-bydocument matrix. The second phase uses the accumulated knowledge to prune query results. Documents are removed from query results based on their learned relationship to documents at the top of the results list. The algorithm can be fine-tuned to be very aggressive or more conservative in its pruning. This algorithm produced increased relevancy of the results and significantly reduces the size of the results list.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Modeling and Classification of Cyberspace Papers Using Text Mining

The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...

متن کامل

Designing a System for Trend Analysis of Users in Website Surfing in Iran Using Data Mining and Text Mining Algorithms

Background and Aim: As of the entrance of web surfing to the lifestyle of a vast majority of people in the society and the need for a more accurate social and cultural policy making in the field, authors intended to analyze the behavior of the society users in viewing different websites so as to help politicians and practitioners. Methods: Design science research method is used in this research...

متن کامل

Image retrieval using the combination of text-based and content-based algorithms

Image retrieval is an important research field which has received great attention in the last decades. In this paper, we present an approach for the image retrieval based on the combination of text-based and content-based features. For text-based features, keywords and for content-based features, color and texture features have been used. Query in this system contains some keywords and an input...

متن کامل

Comparing various attributes of prolactin hormones in different species: application of bioinformatics tools

Prolactin is mainly secreted by the anterior pituitary and is able to stimulate mammary gland development and lactation in mammalians. Although prolactins share a common ancestral gene encoding, they show species specific characteristics and their efficiency may be different in various mammals. The importance of protein structures of all sequences of this hormone have been studied by various bi...

متن کامل

Detecting Diseases in Medical Prescriptions Using Data Mining Tools and Combining Techniques

Data about the prevalence of communicable and non-communicable diseases, as one of the most important categories of epidemiological data, is used for interpreting health status of communities. This study aims to calculate the prevalence of outpatient diseases through the characterization of outpatient prescriptions. The data used in this study is collected from 1412 prescriptions for various ty...

متن کامل

Detecting Diseases in Medical Prescriptions Using Data Mining Tools and Combining Techniques

Data about the prevalence of communicable and non-communicable diseases, as one of the most important categories of epidemiological data, is used for interpreting health status of communities. This study aims to calculate the prevalence of outpatient diseases through the characterization of outpatient prescriptions. The data used in this study is collected from 1412 prescriptions for various ty...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006